Pandas: Working with time series data

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [ ]:
%matplotlib notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 8

Introduction: datetime module

Standard Python contains the datetime module to handle with date and time data:


In [ ]:
import datetime

In [ ]:
dt = datetime.datetime(year=2016, month=12, day=19, hour=13, minute=30)
dt

In [ ]:
print(dt) # .day,...

In [ ]:
print(dt.strftime("%d %B %Y"))

Dates and times in pandas

The Timestamp object

Pandas has its own date and time objects, which are compatible with the standard datetime objects, but provide some more functionality to work with.

The Timestamp object can also be constructed from a string:


In [ ]:
ts = pd.Timestamp('2016-12-19')
ts

Like with datetime.datetime objects, there are several useful attributes available on the Timestamp. For example, we can get the month:


In [ ]:
ts.month

In [ ]:
ts + pd.Timedelta('5 days')

Parsing datetime strings

Unfortunately, when working with real world data, you encounter many different datetime formats. Most of the time when you have to deal with them, they come in text format, e.g. from a CSV file. To work with those data in Pandas, we first have to parse the strings to actual Timestamp objects.

REMEMBER:

from string formatted dates to Timestamp objects: `to_datetime` function

In [ ]:
pd.to_datetime("2016-12-09")

In [ ]:
pd.to_datetime("09/12/2016")

In [ ]:
pd.to_datetime("09/12/2016", dayfirst=True)

In [ ]:
pd.to_datetime("09/12/2016", format="%d/%m/%Y")

A detailed overview of how to specify the format string, see the table in the python documentation: https://docs.python.org/3.5/library/datetime.html#strftime-and-strptime-behavior

Timestamp data in a Series or DataFrame column


In [ ]:
s = pd.Series(['2016-12-09 10:00:00', '2016-12-09, 11:00:00', '2016-12-09 12:00:00'])

The to_datetime function can also be used to convert a full series of strings:


In [ ]:
ts = pd.to_datetime(s)

In [ ]:
ts

Notice the data type of this series: the datetime64[ns] dtype. This indicates that we have a series of actual datetime values.

The same attributes as on single Timestamps are also available on a Series with datetime data, using the .dt accessor:


In [ ]:
ts.dt.hour

In [ ]:
ts.dt.weekday

To quickly construct some regular time series data, the pd.date_range function comes in handy:


In [ ]:
pd.Series(pd.date_range(start="2016-01-01", periods=10, freq='3H'))

Time series data: Timestamp in the index

River discharge example data

For the following demonstration of the time series functionality, we use a sample of discharge data of the Maarkebeek (Flanders) with 3 hour averaged values, derived from the Waterinfo website.


In [ ]:
data = pd.read_csv("data/flowdata.csv")

In [ ]:
data.head()

We already know how to parse a date column with Pandas:


In [ ]:
data['Time'] = pd.to_datetime(data['Time'])

With set_index('datetime'), we set the column with datetime values as the index, which can be done by both Series and DataFrame.


In [ ]:
data = data.set_index("Time")

In [ ]:
data

The steps above are provided as built-in functionality of read_csv:


In [ ]:
data = pd.read_csv("data/flowdata.csv", index_col=0, parse_dates=True)
REMEMBER:

`pd.read_csv` provides a lot of built-in functionality to support this kind of transactions when reading in a file! Check the help of the read_csv function...

The DatetimeIndex

When we ensure the DataFrame has a DatetimeIndex, time-series related functionality becomes available:


In [ ]:
data.index

Similar to a Series with datetime data, there are some attributes of the timestamp values available:


In [ ]:
data.index.day

In [ ]:
data.index.dayofyear

In [ ]:
data.index.year

The plot method will also adapt it's labels (when you zoom in, you can see the different levels of detail of the datetime labels):


In [ ]:
data.plot()

We have to much data to sensibly plot on one figure. Let's see how we can easily select part of the data or aggregate the data to other time resolutions in the next sections.

Selecting data from a time series

We can use label based indexing on a timeseries as expected:


In [ ]:
data[pd.Timestamp("2012-01-01 09:00"):pd.Timestamp("2012-01-01 19:00")]

But, for convenience, indexing a time series also works with strings:


In [ ]:
data["2012-01-01 09:00":"2012-01-01 19:00"]

A nice feature is "partial string" indexing, where we can do implicit slicing by providing a partial datetime string.

E.g. all data of 2013:


In [ ]:
data['2013']

Normally you would expect this to access a column named '2013', but as for a DatetimeIndex, pandas also tries to interprete it as a datetime slice.

Or all data of January up to March 2012:


In [ ]:
data['2012-01':'2012-03']
EXERCISE:
  • select all data starting from 2012

In [ ]:
# %load snippets/05 - Time series data36.py
EXERCISE:
  • select all data in January for all different years

In [ ]:
# %load snippets/05 - Time series data37.py
EXERCISE:
  • select all data in January, February and March for all different years

In [ ]:
# %load snippets/05 - Time series data38.py

In [ ]:
data = data.drop("months", axis=1)
EXERCISE:
  • select all 'daytime' data (between 8h and 20h) for all days

In [ ]:
# %load snippets/05 - Time series data40.py

In [ ]:
# %load snippets/05 - Time series data41.py

The power of pandas: resample

A very powerfull method is resample: converting the frequency of the time series (e.g. from hourly to daily data).

The time series has a frequency of 1 hour. I want to change this to daily:


In [ ]:
data.resample('D').mean().head()
NOTE:
  • with older versions of pandas, data.resample('D').mean() was expressed as data.resample('D', how='mean').

Other mathematical methods can also be specified:


In [ ]:
data.resample('D').max().head()
REMEMBER:

The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

These strings can also be combined with numbers, eg `'10D'`...

In [ ]:
data.resample('A').mean().plot() # 10D
EXERCISE:
  • plot the monthly standard deviation of the columns

In [ ]:
# %load snippets/05 - Time series data45.py
EXERCISE:
  • plot the monthly mean and median values for the years 2011-2012 for 'L06_347'

**Note**
You can create a new figure with `fig, ax = plt.subplots()` and add each of the plots to the created `ax` object (see documentation of pandas plot function)

In [ ]:
# %load snippets/05 - Time series data46.py

In [ ]:
# %load snippets/05 - Time series data47.py
EXERCISE:
  • plot the monthly mininum and maximum daily average value of the 'LS06_348' column

In [ ]:
# %load snippets/05 - Time series data48.py

In [ ]:
# %load snippets/05 - Time series data49.py
EXERCISE:
  • make a bar plot of the mean of the stations in year of 2013 (Remark: create a `fig, ax = plt.subplots()` object and add the plot to the created ax

In [ ]:
# %load snippets/05 - Time series data50.py

In [ ]:

EXERCISE:
  • Calculate the typical yearly pattern with monthly resolution (plot of the typical monthly average over the years)

In [ ]:


In [ ]: